Partial least squares enhance multi-trait genomic prediction of potato cultivars in new environments

It is of paramount importance in plant breeding to have methods dealing with large numbers of predictor variables and few sample observations, as well as efficient methods for dealing with high correlation in predictors and measured traits. This paper explores in terms of prediction performance the partial least squares (PLS) method under single-trait (ST) and multi-trait (MT) prediction of potato traits. The first prediction was for tested lines in tested environments under a five-fold cross-validation (5FCV) strategy and the second prediction was for tested lines in untested environments (herein denoted as leave one environment out cross validation, LOEO). There was a good performance in terms of predictions (with accuracy mostly > 0.5 for Pearson’s correlation) the accuracy of 5FCV was better than LOEO. Hence, we have empirical evidence that the ST and MT PLS framework is a very valuable tool for prediction in the context of potato breeding data.


Results
Tables and figures are shown for two metrics: (1) correlations (ρ) between predictive genetic values and their corresponding observed values (testing set) and (2) the normalized root mean squared error of prediction (NRMSE) for single-trait ST-PLS and multi-trait MT-PLS under two cross-validation, 5FCV and LOEO (including their standard errors, SE when appropriate). Tables 1, 2 and 3 list the prediction accuracy results for three traits, total tuber weights, flesh tuber starch (%) and flesh reducing sugar, respectively. Also, the results from Tables 1, 2 and 3   Table 1. Partial least square (PLS) accuracy measured as correlation between observed and predicted values (ρ), and normalized Root Mean Squared Error (NRMSE) with their respective standard errors (SE) for genomic prediction of total tuber weight in 10  Total tuber weight. Table 1 and Fig. 1A and B show the genomic prediction accuracy based on correlation, and NRMSE for ST-PLS and MT-PLS based on 5FCV and LOEO for each of the six location-year combinations (H20 H21, M20, M21, U20, U21) and across (global) environments.
Correlations. Results show that for 5FCV, MT-PLS gave slightly higher prediction accuracy than ST-PLS in terms of correlation (ρ). Exceptions were at the prediction of environments H21 ( Flesh tuber starch. The genomic prediction accuracy measured based on correlation, and NRMSE for ST-PLS and MT-PLS for two metrics, 5FCV and LOEO, for each of the six location-year combination (H20 H21, M20, M21, U20, U21) and across all environments are given in Table 2 and Fig. 2A and B.
Correlations. In general, genomic predictions measured by correlations of trait flesh tuber starch are high (around 0.8-0.9) for most of the environments except for U21 (around 0.4). Results show that for 5FCV, MT-    In summary for most of the environment's MT-PLS gave higher prediction accuracy than ST-PLS for both correlation and NRMSE. Furthermore, results for this trait demonstrated the high prediction accuracy achieved for any metric used or any model including ST or MT but with a consistent increase of MT over ST for most of the environments.
Flesh reducing sugar. Table 3 and Fig. 3A and B had the genomic prediction accuracy based on correlation, and NRMSE for ST-PLS and MT-PLS based on 5FCV and LOEO for each of the six location-year combinations (H20 H21, M20, M21, U20, U21) and across (global) environments.  www.nature.com/scientificreports/ Tables S1-S4. On average, the PLS-based prediction for all weights as per their tuber size were smaller than those noted for total tuber yield and tuber flesh starch. The largest prediction accuracy was noted for weight of tubers below 40 mm ( Supplementary Fig. S1) or above 60 mm ( Supplementary Fig. S4). The correlations obtained by the 5FCV were mostly higher than those obtained by LOEO. The MT-PLS prediction accuracy across environments was larger than the ST-PLS, though MT-PLS prediction accuracy in some environments was smaller than respective ST-PLS, e.g. for weight of tuber below 40 mm (H20), or for weight of tubers between 40 and 50 mm (M20). ST-PLS and MT-PLS gave a zero prediction (or negative result) for weight of tubers between 50 and 60 mm in H21.

Discussion
Tuber flesh starch had the largest prediction accuracy (MT-PLS: 0.9416 ± 0.0045; ST-PLS: 0.9367 ± 0.052) under 5FCV (Fig. 2). This trait has the highest heritability (0.933) in the reference germplasm across the target population of environments of Scandinavia 29 . The prediction accuracy for total tuber weight ( Fig. 1) was larger than those observed according to size (Figs. S1-S4). As indicated by Ortiz et al. 29 total tuber weight also had larger heritability estimates (0.836) than those noted for tuber weight at different sizes (0.581-0.806). Reducing sugars in the tuber flesh had a lower prediction accuracy (Fig. 3) than both tuber flesh starch (Fig. 2) and total tuber weight ( Fig. 1), which could result from having a smaller heritability (0.778) than the other two tuber traits 29 . These results suggest that applying selection based on genomic-estimated breeding values will be effective for high heritability traits in potato.
To the best of our knowledge the MT-PLS prediction accuracy for tuber flesh starch seems to be the largest ever estimated for any characteristic in potato. As per previous research, prediction accuracy for tuber flesh starch or specific gravity ranged from 0.09 to 0.83 20 . Similarly, the MT-PLS for tuber weight and reducing sugars in the tuber flesh are above or in the high end than those noted in early research, whose ranges were 0.05-0.75 and 0.11-0.79, respectively 20 . Most of these previous prediction accuracy estimates were based mostly of ST GBLUP. The ST models are trained to predict a single trait at a time (continuous, binary, categorical or count), while MT models are trained to simultaneously predict at least two traits. MT models are preferred over ST models because they represent complex relationships between traits, and simultaneously make use of the correlations between cultivar, traits, and environments. MT are more efficient to train computationally than each ST model, they improve indirect selection because of increased precision of genetic correlation estimates between traits. MT models can increase prediction accuracy of low heritability traits that have a significant correlation with high heritability 24,30 . MT models improve parameter estimates and prediction accuracy as compared to ST models if traits are moderately correlated 24,[30][31][32][33][34] .
Adding multiple traits and multiple environments when using the PLS method for genomic prediction gives potato breeders more information that allows handling the significant genotype-by-environment interaction (GEI) that often affects tuber characteristics, particularly for total tuber yield. Prediction accuracy increases by considering GEI and correlated characteristics, thus improving the genomic selection approach for potato breeding in the target population of environments. Identifying breeding clones or cultivars according to their genomic estimated breeding values determined using PLS models that consider GEI will facilitate their further use as potential parents in potato breeding programs, thus increasing genetic gains.
The PLS method can be an alternative method for genomic prediction because it is very powerful for modelling data with inputs with large dimensionality and highly correlated; i.e., PLS naturally is able to handle more independent variables than observations that are highly correlated. PLS is the method for making good predictions in multivariate problems. Likewise, the PLS method offer high computational and statistical efficiency, as well as great flexibility and versatility in terms of the analysis problems that may be addressed 35 . For this reason the PLS method had been implemented in many areas of research for solving association and prediction problems 22,24,28,36,37 . In the case of prediction problems had been used for ST and MT predictions as well for the prediction of continuous, binary and categorical response variables. PLS originally was not proposed for association research, since the goal of the method was to find the significant linear subspace of the independent variables, not the variables themselves, but a large number of association research had been done applying PLS for variable selection. In this context, it has been used for the identification of genes associated with the considered outcome and for genome wide association study (GWAS) due to its competitive power and false discovery rates 38 .

Conclusion
The PLS method is highly suited for genomic prediction in potato breeding when high dimensional and correlated genomic and other omics data are available. However, there were not large differences observed under a ST and MT framework. Likewise, better prediction performance was obtained under the prediction problem of tested lines and tested environments (5FCV), than under the tested line and untested environments (LOEO), which was expected because the LOEO cross-validation is a difficult prediction problem. The results are very promising since one can predict most potato traits with high accuracy using the PLS framework.

Materials and methods
Multi-site testing involves six trials that included up to 256 breeding clones and released cultivars grown in Europe www.nature.com/scientificreports/ Umeå, respectively, while the rainfall ranges were 42-64 mm in Skåne and 48-75 mm in Umeå. The average daylength ranged from 11.5 h (around harvest) to 17.5 h (mid-growing season) in Skåne, and from 14.5 (harvest) to ca. 21 h (early cropping season) in Umeå. Fungicides were used against the oomycete Phytophthora infestans in Helgegården to avoid late blight in the potato crop throughout the growing season. In this way, tuber yield potential could be estimated at this testing site. Tubers used as planting material were either from SLU's Svensk potatisförädling or acquired through purchasing. Relevant institutional, national, and international guidelines and legislation were considered for field research. Crop husbandry at each site was the same used for potato farming. The characteristics evaluated were total tuber yield in a 10-plant plot (kg), tuber weight (kg) by size (< 40 mm, 40-50 mm, 50-60 mm, > 60 mm) in the 10-plant plot, while tuber flesh starch was calculated by determining specific gravity after harvest 39 . Potato glucose strip tests were used for measuring reducing sugars in the tuber flesh 40 . Heritability based on variance components, as well as genetic and phenotypic correlations (Supplementary Tables S1 and S2) were estimated following Ortiz et al. 29 Targeted genotyping -following a genotype-by-sequencing approach (https:// www. diver sitya rrays. com/ techn ology-and-resou rces/ targe ted-genot yping/) was used for characterizing 256 breeding clones and released cultivars with 2503 single nucleotide polymorphisms (SNPs), which were mostly derived after filtering SolCAP SNPs with known chromosome positions and MAF above 1% in germplasm from the Centro Internacional de la Papa (CIP, Lima, Perú) and the USA. Such a number of SNP suffices for genomic estimated breeding values without losing information 41 . The breeding clone 97 and cultivars 'Leyla' and 'Red Lady' were not included further in the genomic prediction analysis because they were lacking enough SNP data.

Single-trait partial least squares (ST-PLS) and multi-trait partial least square (MT-PLS) methods. PLS is a single-trait (ST) and multi-trait (MT) regression statistical machine learning technique intro-
duced by Wold 42 in econometrics and chemometrics. PLS is very effective for prediction problems where the number of inputs ( p) is larger than the number of observations ( n) ; i.e., under p > n problems, and also when inputs are highly correlated. This article describes the MT version of PLS, since the ST version works in a similar fashion to the MT version, except that the response variable (Y) is a vector instead of a matrix. We assumed that we had a matrix of response variables (Y) of order n × n T ( n T = numberoftraits that is related to a set of explanatory variables ( X ) of order n × p 35,43 . In PLS, instead of regressing Y on X, we regressed Y on T , where T are the latent variables (LVs), also called latent vectors or X-scores; these LVs are related to the original X and Y matrices. The goal of PLS regression is to maximize the covariance between Y and T ; however, an iterative procedure is required for its computation. The basic steps to compute the LVs under a multivariate framework using the kernel algorithm for PLS are provided below.
Step 1. Initialization of matrices, E = X and F = Y . Center each column of E and F ; scaling is optional.
Step 2. Compute S = X T Y (Cross product matrix) and then SS T = X T YY T X and S T S = Y T XX T Y.
Step 3. Compute the singular value decomposition (SVD) of SS T and S T S.
Step 4. Obtain w and q, the eigenvectors to the largest eigenvalue of SS T and S T S, respectively.
Step 5. Compute scores t and u as t = Xw = Ew and u = Yq = Fq.
Step 6. Normalize the t and u scores as t = t/ √ t T t and u = u/ √ u T u.
Step 7. Next, compute X and Y loadings as p = E T t and q = F T t.
Step 8. Deflate matrices E and F as E n+1 = E n − tp T and F n+1 = F n − tq T .
Step 9. Use as input E n+1 and F n+1 , of Step 8,in Step 2, and repeat steps 2 to 9 until the deflated matrices are empty or the necessary number of components have been extracted.
With the resulting w , t , p and q vectors, the matrices W, T, P, and Q, respectively, are built. Finally, after having all the columns of W, we compute R as: Next, with R we can compute the LVs, which are related to the original X matrix as: Next, since we regressed Y on T , the resulting beta coefficients are b = (T T T) −1 T T Y . However, to convert these back to the realm of the original variables ( X) , we pre-multiplied with matrix R the beta coefficients ( b ); since T = XR, To obtain optimal performance of the PLS method, only the first a components are used. Since regression and dimension reduction are performed simultaneously, all B , T , W , P and Q are part of the output. Both X and Y are considered when calculating the LVs in T . Thereafter, predictions for new data ( X new ) should be done with: where T new = X new R . In this application, the optimal number of components was determined by crossvalidation. We used the NRMSE, with an inner fivefold cross-validation for selecting the optimal number of hyperparameters.
In this application, we used as the input matrix or matrix of independent variables X, the concatenation of information of Environments + Genotypes + Genotypes × Environments information. For this reason, we first computed the design matrices of environments ( X E ), the design matrix of genotypes ( X g ) and the design matrix of the Genotype × Environments term ( X gE ). Note that PLS method does not allow including directly (as mixed www.nature.com/scientificreports/ models do), genomic relationship information and genotype × environment interaction: (1) genomic relationship matrix of lines K L = MM T /r where M denotes the matrix of markers (coded as 0, 1 and 2) of order J × r , J denotes the number of lines and r the total number of markers, and the (2) genotype by environment relationship matrix ( K LE = K E × K L ) , where (K E = X E X T E /I) (where I denotes the number of environments under study). Thus, to incorporate into the input matrix these relationship information's, the design matrices of lines and genotype × environments were post-multiplied by their corresponding square root matrices of their corresponding relationship matrices. That is, instead of using only X g and X gE as input, we used X g L g (with L g = K 0.5 L )and X gE L gE (with L gE = K 0.5 LE ). For this reason, the final input matrix used was X = X E , X g L g , X gE L gE . We did not post-multiply the design matrix of environments ( X E ) since K E is an identity matrix due to the fact that we did not compute an environmental relationship matrix with environmental covariates, only with the dummy values of the position of environments. For this reason, both ST and MT PLS methods were used as input of the matrix of response variables ( Y ) and the input matrix X, just defined above, but the ST PLS was fitted one at a time for each column of Y. The implementation of both ST and MT PLS models was done with the R statistical software 44 using the PLS 45 . Datasets and metrics for the evaluation of prediction accuracy. Answers to two prediction problems were pursued. The first was for tested lines in tested environments under a five-fold cross-validation (5FCV) strategy and the second was for tested lines in untested environments under a leave one environment out (LOEO) cross-validation strategy 46 . Under the 5FCV, we randomly divided the dataset into 5 subsets of similar size, using 5 − 1 = 4 subsets as the outer training set and the remaining group as the outer testing set until each of the 5 subsets played the role of outer testing set once. Since we implemented PLS method (ST and MT), we divided the respective training set into an inner training set (80% of the training set) and a validation set (20% of the training set) to be able to tune (select the optimal) the number of principal components required in the PLS method. This nested cross-validation was also implemented under fivefold cross-validation. Then, the average of the five validation sets was reported as the accuracy of the predictions to select the optimal hyperparameter (principal components that must be retained). Then with this optimal hyperparameter we refitted the PLS method and with this refitted model we performed the prediction of each outer testing set. Again, prediction performance was reported as the average of the five outer testing sets.
Similarly, under the LOEO strategy of cross-validation, I − 1 environments were assigned to the outer-training set and the remaining were assigned to the outer-testing set, until each of the I environments were tested once. Also, for tuning the hyperparameter of the PLS (ST and MT) methods, we performed a nested 5FCV strategy, that is, the outer training set was five-fold, one was used as the validation set and the remaining four as inner-training. Then, the average of the five validation folds was reported as the metric of prediction performance to select the optimal hyperparameter (number of principal components) in the ST-PLS and MT-PLS models 44 . Then, using this optimal number of hyperparameters, both PLS models (ST and MT) were refitted with the whole training set (the I − 1 environments) and finally, the prediction of each testing set (a full environment) was obtained. The 5FCV under inner and outer-cross-validation was repeated only one time. For the inner cross-validation under 5FCV and LOEO, we used as metric the normalized root mean square error ( NRMSE = RMSE y ), where , with y i denoting the observed value i , while f (x i ) represents the predicted value for observation i , with i = 1, . . . , nnumberofobservations) . We used this metric for the inner cross-validation because it is one of the most appropriate metrics for comparisons when the model is multi-trait and the response variables are on different scales, since it is not dependent on the effect of the scale of the traits. However, for reporting the final accuracy (correlation between the predicted genetic value and the phenotypic value) with the outer cross-validation in addition to the NRMSE, we also reported the average Pearson's Correlation.

Data availability
The genomic matrix K used in the models and all the data are stored at the link https:// hdl. handle. net/ 11529/ 10548 784, while the R scripts are available at: https:// github. com/ osval 78/ Potato_ 2023.